Arrangement and method for audio source tracking

ABSTRACT

An arrangement and a method of localizing active speakers in a video conference includes a localization device that locates at least one microphone relative to the camera, while the at least one microphone in turn localizes the relative positions of audio sources. As the microphones are usually positioned close to the audio source in a video conference, the ratio of the distance between microphones relative to the distance between table microphone and audio source is reduced. Thus, the microphones are able to determine the positions of the audio sources with a higher resolution than if placed close to the camera. When the respective positions of the microphones relative to the camera and the audio source are known, the position of the audio source relative to the camera is then determined by means of geometrical calculations.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to signal source localization, inparticular an arrangement and method of spatially localizing activespeakers in a video conference.

2. Discussion of the Background

Signal localization is used in several applications. The most knownapplication is perhaps TV program production. For example, in debateprograms, it is important for the viewer's experience andintelligibility that the active camera is pointing at, and preferablyzooming on, the current speaker. However, this has traditionally beenhandled manually by a producer. In other applications where cameras andmicrophones are capturing the view and sound of a number of people, itmight be impossible or undesirable to have a dedicated person to controlthe performance.

One example of such application is automatic camera pointing in videoconferencing systems. A typical situation at an end-point in a videoconference call is a meeting room with a number of participants sittingaround a table watching the display device of the end-point, while acamera positioned near the display device is capturing a view of themeeting room. If there are many participants in the room, it may bedifficult for those who are watching the view of the meeting room at afar end side to determine the speaker or to follow a discussion betweenseveral speakers. Thus, it would be preferable to localize the activespeaker in the room, and automatically point and/or zoom the camera ontothat participant. Automatically orienting and zooming of a camera givena certain position within reach of the camera, is well known in the art,and will not be discussed in detail. The problem is to provide asufficiently accurate localization of the active speaker, both in spaceand in time, in order to allow acceptable automatic video conferenceproduction.

Known audio source localization arrangements use a plurality ofspatially spaced microphones, and are often based on the determinationof a delay difference between the signals at the outputs of thereceivers. If the positions of the microphones and a delay differencebetween the propagation paths between the source and the differentmicrophone are known, the position of the source can be determined. Iftwo microphones are used, it is possible to determine the direction withrespect to the baseline between them. If three microphones are used, itbecomes possible to determine a position of the source in a 2-D plane.If more than three microphones, not placed in a single plane, are used,it becomes possible to determine the position of a source in threedimensions.

One example of audio source localization is shown in U.S. Pat. No.5,778,082. This patent teaches a method and a system using a pair ofspatially separated microphones to obtain the direction or location ofan audio source. By detecting the beginning of the respective signals ofthe microphones representing the sound of the same audio source, thetime delay between the audio signals may be determined, and the distanceand direction to the audio source may be calculated.

In these and other known solutions to audio localization the microphonesused for direction and distance calculations are placed close to thecamera. The camera is usually placed on top of the screen, beyond theend of the conference table. At least some of the participants will beseated at a long distance (r) from the microphone setup. This setup hassome disadvantages as discussed below.

Due to the long distance between the speakers and the microphone setup,the expected spread of direction angles is small, and the spread ofsound arrival time differences is correspondingly small. This reducesthe accuracy of the localization algorithm. However, due to the longdistance r, the algorithm should be precise.

One way of increasing the time arrival differences is to increase thedistance between the microphones, denoted d. However, prior art hasshown that d can not be increased too much, as the signals into thedifferent microphones tend to get uncorrelated with too large d. Priorart has shown that a distance d of 20-25 cm provides the best results.

In particular, the calculation of the distance is prone to errors intraditional systems, as this distance is calculated using a minor angledifference between relatively closely spaced microphone pairs. Thus,this method assumes that the speaker is in the near field of themicrophone system, which in many cases is a questionable assumption.

The level of the direct sound (which is the sound used for calculatingthe direction) is inversely proportional to the distance r. Due to thelong distance between the speaker and the microphones, the signal fromthe speaker will be weak, and therefore sensitive to background noiseand self noise of the microphone and electronics.

Due to the long distance, reflections of the sound from the speaker mayreach the microphone setup with almost as high level as that of thedirect sound. Therefore, incorrect and inaccurate decisions can be made.

These disadvantages will always be a hindrance, but can be compensatedfor by integrating the audio over a long timeframe. However, this againhas the disadvantage of a slowly responding system, which is a typicalweakness of existing audio tracking systems.

SUMMARY OF THE INVENTION

Accordingly, an object of this invention is to provide a novelarrangement and a method of localizing a position of an audio sourcerelative to a camera by determining the position of the audio sourcerelative to one or more microphone(s) or array(s) of microphoneelements, and geometrically deriving a first distance and/or directionbetween the camera and the audio source from the position of the audiosource relative to one of the one or more microphone(s) or array(s) anda second distance and/or direction between the camera and one of the oneor more microphone(s) or array(s).

BRIEF DESCRIPTION OF THE DRAWINGS

A more complete appreciation of the invention and many of the attendantadvantages thereof will be readily obtained as the same becomes betterunderstood by reference to the following detailed description whenconsidered in connection with the accompanying drawings, wherein:

FIG. 1 illustrates the geometry of an example of determining an angleand a distance between a camera and an audio source in the verticalplane,

FIG. 2 illustrates the geometry of an example of determining an anglewith respect to a pair of microphones receiving acoustic signals from asource with far field assumption, and

FIG. 3 is a block diagram illustrating a video conference systemaccording to an embodiment of the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

Referring now to the drawings, wherein like reference numerals designateidentical or corresponding parts throughout the several views, and moreparticularly to FIG. 1 thereof, which illustrates one possibleembodiment of the present invention. In FIG. 1, an arrangement used tolocalize the position of an audio source, participant 1, includes atleast one microphone device 3 positioned separately from a camera 2. Adistance “b” from the microphone device 3 to the participant 1 is lessthan a distance “a” from the microphone device 3 to the camera 2, thusallowing a more correct near field assumption.

A localization device, preferably placed as close to the camera in useas possible, locating one or more of the microphones, is preferablypositioned as close to the participants as possible, while themicrophone(s) (from now on referred to as table microphone) in turnlocalizes the audio source relative to their own position(s). The tablemicrophone is provided with two or more microphone elements, oralternatively two or more separate table microphones can be utilized. Asthe table microphone is positioned close to the audio source, the ratioof the distance between microphone elements relative to the distancebetween table microphone and audio source is reduced. Thus, the tablemicrophone is able to determine the position of the audio source with ahigher resolution and speed than if it was placed close to the camera.

When the respective positions of the table microphone relative to thecamera and the audio source are known, it would be quite straightforwardto find the position of the audio source relative to the camera. In thisway, the accuracy of the result becomes less dependent on the placementof the audio source relative to the camera, than on how close the tablemicrophone is to the audio source, and the accuracy and speed of thelocalization of the table microphone relative to the camera. The latteris far more controllable than the direct relationship between camera andaudio source.

As already indicated, the idea is to combine two or more coordinatesystems to locate the active speaker. One or more coordinate system willbe positioned at the camera side, and one or more coordinate system atthe microphone side. The position and orientation of the tablemicrophone relative to the camera can be calculated by either manualmeasurements (in the case of fixed position of the table microphone),some kind of pattern recognition, using a signal source like sound, IR,RF, etc on the table microphone, or by letting the camera side have oneor more signal sources, which can be picked up by the table microphone.The invention utilizes the fact that the relative position betweencamera and table microphone is likely to be more accurate than thedirect detection of the position of an audio source relative to thecamera. Further, the detection equipment is placed close to theparticipants to be tracked allowing near field calculations instead offar field calculations in order to get accurate measurements and thencalculate the direction and distance of this equipment relative to thecoordinate system of the camera. Finally, these calculations arecombined to find the direct direction and distance from the camera tothe participant.

One way of calculating an audio source direction, according to anembodiment of the present invention, is illustrated in FIG. 2. The timedelay t between the acoustic signals reaching MIC B (22) and MIC A (21)is determined according to the state of the art, e.g. by signal onsetdetection as described in U.S. Pat. No. 5,778,082, or bycross-correlating the impulse responses of the acoustic paths to MIC B(22) and MIC A (21), respectively, as described in the Internationalpatent application no. WO 00/28740.

Once the time delay signal t is generated, the bearing angle θ of thesource C (23) relative to MIC B (22) and MIC A (22) may be determinedaccording to$\theta = {\arcsin\left\lbrack \frac{v \times t}{D} \right\rbrack}$where v is the velocity of sound, t is the time delay, and D is thedistance between the table microphones. This method of estimating thedirection of the acoustic source is based upon a far field approximationwherein the acoustic signals are assumed to reach MIC A (21) and MIC C(22) in the form of a flat or plane wave. If the assumption of planewaves is not appropriate for a particular application, then othertechniques may be used for determining the direction or location ofsource C (23) with respect to the MIC A (21) and MIC B (22). Suchtechniques may include, for example, incorporating additionalmicrophones into the system, and generating delays corresponding to thedifference of arrival times of signals at the additional pairs ofmicrophones according to the method described above. The multiple timedelays may then be used, according to known techniques, to determine thedirection or location of the source C (23).

The above described method estimates the direction from an audio sourcein one plane only in a far field consideration. To achieve athree-dimensional estimation with this approach, at least a thirdmicrophone or microphone element, MIC C (not shown), not aligned withthe two others must be added. MIC C together with MIC A (21) andtogether with MIC B (22), constitutes two additional microphone pairs.

To obtain the position of the audio source relative to the tablemicrophone, taking near field considerations into account, a moresophisticated method may be required. An example of this is the MaximumLikelihood (ML) localization method i.e. described in “Acousticlocalization of voice sources in a video conferencing environment”,1997, by Erik Leenderts. The ML method takes advantage of thestatistical benefit of combining all possible microphone pairs. Thepurpose of this method is to find the most likely source position byusing all the delay information that the table microphone arrangementcan provide (by some Time Delay Estimator method e.g. according to U.S.Pat. No. 5,778,082), combined with expected time delays for a number ofpositions.

For every point P=(x_(p), y_(p), z_(p)) in a room, associated expectedtime delays can be calculated for every microphone pair. For the pairconsisting of microphones M_(i) and M_(k), the relative delay seen fromP, referred to as τ_(ik)(P), can be exactly calculated when themicrophone positions are known. This calculation is well known in theart, and will not be described in detail her. The method assumes that ifP is a different place than the source S₀, τ_(ik)(P) differs fromτ_(ik). Using N_(mics) microphones, up to $N_{pairs} = \begin{pmatrix}N_{mics} \\2\end{pmatrix}$different microphone pairs can be constructed, each with an associatedestimated time delay per P. These estimates can be combined to create anerror-placement function E(P) for all positions P in the room:${E(P)} = {\sum\limits_{\underset{k = {i + 1}}{i = 1}}^{N_{mics}}\quad\left( {{\tau_{ik}(P)} - \hat{\tau_{ik}}} \right)^{2}}$where τ{circumflex over ( )}_(ik) is the estimated time delay for M_(i)and M_(k). This function can be expected to produce a minimum at P=S₀.

If the exact source position is found, then P=S₀, and the error functionbecomes${E\left( S_{0} \right)} = {\sum\limits_{\underset{k = {i + 1}}{i = 1}}^{N_{mics}}\quad\left( {\tau_{ik} - \hat{\tau_{ik}}} \right)^{2}}$which in an ideal environment would result in E(S₀)=0.

The method described makes it possible to combine all microphone-pairswithout any geometrical errors being introduced.

Due to noise and reverberation, some delay estimates will be morereliable than others. Some estimations may even turn out not to beuseful at all. If the reliability of each Time Delay Estimation (TDE)was known, a weighting function could be included in the error function:${E(P)} = {\sum\limits_{\underset{k = {i + 1}}{i = 1}}^{N}\quad{\beta_{ik}\left( {{\tau_{ik}(P)} - \hat{\tau_{ik}}} \right)}^{2}}$where β_(ik) is the weighting parameter for delay estimate {circumflexover (τ)}_(ik).

Because some delay estimates may now be completely rejected, it must bechecked if the remaining delay estimates are geometrically able tolocate the source. If so, the estimate will be a lot more accurate thanif all delay estimates had been taken into account. If this were not thecase, localization would be inaccurate anyway.

How to find β_(ik) requires a thorough investigation, and will not beconsidered any further here.

Finding the minimum point of the E(P)-function, and thereby the mostlikely audio source position, can be done by calculating E-values for aset of P's and find minimum among these, or by the use ofgradient-search methods.

If a pre-defined selection of possible and probable source positions(relative to the table microphone position) is used, allτ_(ik)(P)-values can be calculated before localization is performed.When delays are estimated, these can be compared with the pre-calculatedpoint delays in order to find the minimum point on the E-function. Ifthe potential points are separated by 10 cm in all directions, thesystem can be expected to miss the actual source by less than {squareroot}{square root over (5²+5²+5²)}=8.7 cm.

The expected participant area in a conference situation is limited. If,for example, the participants are located within 1 to 5 meters in frontof the table microphone, and maximum 3 meters to each side, this meansgenerating (400/10+1)*(600/10+1)=2501 points when using 10 cm grid size.Another reasonable approximation in a videoconference application is toexpect the audio source to be located between 100 cm and 180 cm abovethe floor.

Under these conditions the total number of calculation points, stillwith 10 cm grid-size, now becomes 2501*(80/10+1)=22509.

The area of “legal” source positions may be further restricted, butstill leaves several thousand E-values to be calculated. For thisreason, gradient search can be expected to provide higher timeefficiency.

There are many other possible ways of determining the position of anaudio source relative to the table microphone, most of which areincreasing in accuracy and resolution the closer the table microphone isto the audio sources (r), relative to the distance between themicrophone elements (D). However, it is to be noted that if D becomestoo large, the respective received sound from the same audio source willdiffer too much (due to reflections etc.) so that delay measurementsbecomes impossible. Thus, D has an upper operating limit. Prior artshows that the optimal distance D is in the range of 20-25 cm.

The present invention transfers the advantages of operating in the nearfield to the overall far field calculation of the position of the audiosource relative to the camera. The already mentioned calculation methodsmay of course also be utilized in far field part, i.e. in determiningthe position of the table microphone relative to the camera, but in thiscase, the positions involved are more controllable, allowing thecalculation to be faster and more accurate even if it is a far fieldcalculation. In addition, as opposed to the case of microphone/audiosource, this positioning process is not restricted to a one-waycalculation. That is, the camera may detect the position of the tablemicrophone, as well as the table microphone may detect the position ofthe camera. Further, because the table microphone and the camera in mostapplications would be stationary, less sophisticated and speed demandingmethods are required. In some applications, when both table microphoneand camera are fixed, even predefined values for distance and directionmay be used.

In a preferred embodiment of the present invention, all the positioningfunctions are provided by the table microphone in order to limit theadjustment of other equipment associated with the video conferencingequipment. In this embodiment, the only adjustments except from in thetable microphone, is an auxiliary sound source mounted on, or close to(or in a known or detectable relation to), the camera. The tablemicrophone is adapted to recognize a known signal from this auxiliarysound source. The auxiliary sound source may emit sound with a frequencyoutside the human audible frequency range and/or with amplitude notdetectable for the human ear in order to not interfere with the ongoingconference. The auxiliary sound source may also be a loudspeaker of thevideoconference equipment in use. In that case, the position of theloudspeaker relative to the camera must be known, or detected for eachtime.

As earlier indicated, when controlling the audio source that is to belocated, the localization can be much more accurate and less timeconsuming than a non-controllable audio source like a speaker. Thepropagation delay from the loudspeaker to the microphone system can bederived from the corresponding transfer function. A common usedtechnique for measuring the transfer function from a loudspeaker to amicrophone is the Maximum-Length Sequences (MLS) technique. The MLSsignals are a family of signal types with certain characteristics. Themost important characteristic in this context is the fact that when fedto the input of a system, their cross-correlation with the system outputgives exactly the system impulse response. This is derived from thefollowing set of equations, where h is the impulse response of thesystem, y is the output signal of the system having an MLS signal x asinput, r is the cross-correlation function and δ is the delta function:$\begin{matrix}{y = {h*x}} \\{{y(n)} = {\sum\limits_{k = {- \infty}}^{\infty}\quad{{h(k)} \times \left( {n - k} \right)}}} \\{{r_{yx}(l)} = {\sum\limits_{m = {- \infty}}^{\infty}\quad{{y(m)}{x\left( {m - l} \right)}}}} \\{{r_{yx}(l)} = {\sum\limits_{m = {- \infty}}^{\infty}{{x\left( {m - l} \right)}{\sum\limits_{k = {- \infty}}^{\infty}{{h(k)}{x\left( {m - k} \right)}}}}}} \\{{r_{yx}(l)} = {\sum\limits_{k = {- \infty}}^{\infty}{{h(k)}{\sum\limits_{m = {- \infty}}^{\infty}{{x\left( {m - l} \right)}{x\left( {m - k} \right)}}}}}} \\{{r_{yx}(l)} = {\sum\limits_{k = {- \infty}}^{\infty}{{h(k)}{r_{xx}\left( {l - k} \right)}}}} \\{{r_{yx}(l)} = {h*{r_{xx}(l)}}} \\{{r_{yx}(l)} = {h*{\delta(l)}}} \\{{r_{yx}(l)} = {h(l)}}\end{matrix}$When inputting an MLS signal in the auxiliary sound source (e.g. aloudspeaker) of the system of the present invention, and measuring therespective outputs of the microphones, the impulse responses of thesystems consisting of auxiliary sound source—acousticenvironment—microphone, could be determined. The impulse responsediscloses the absolute delay of the signal, implicitly also disclosingthe absolute distance between sound source and microphone. The relativedelay between the receiving time of the signal in the respectivemicrophones or microphone elements and the distances between them,enable estimation of the direction to, and orientation of, the tablemicrophone relative to the sound source.

An alternative embodiment of the present invention, utilizes the visualcapabilities of the camera. The table microphone is then provided withan easily recognizable shape or pattern being pre-stored and accessibleto the camera. In this way, the camera itself (or the control unit)would be enabled to calculate the position of the table microphone byderiving the size and placement of the recognizable pattern within theview captured by the camera. Alternatively, the pattern may include twoor more controllable light sources to assist the camera in recognizingand positioning the table microphone. The control unit may also beadjusted to measure the time of the light to travel from the tablemicrophone to the camera, and by that deriving the position.

In still another embodiment of the invention, the camera and the tablemicrophone use RF (Radio Frequency) detection to localize each other ina local localization system. Of course, the relative position betweentable microphone and camera may also be fixed.

When the relative positions between camera and table microphone, as wellas between table microphone and audio source are found, only a tediousgeometrical calculation remains to find the relative position betweencamera and video source. Referring to FIG. 1, this determinationincludes calculating the angle α₃ and the distance c given the angles α₁and α₂, and the distances a and b. Geometrical considerations imply thefollowing expressions for the distance c and the angle between cameraand audio source in the vertical plane:c={square root}{square root over (a ² tan ² α ¹ +b ² tan ² α ² )}$\alpha_{3} = {\arcsin\left( \frac{{a\quad\sin\quad\alpha_{1}} - {b\quad\sin\quad\alpha_{2}}}{c} \right)}$The corresponding values for the horizontal plan can be calculated inexactly the same way. Given the position of the camera, the threedimensional position of the audio source may then be easily calculatede.g. by the Pythagorean theorem.

With the information about direction to the participant(s) (1) it ispossible for a motorized camera (2) to be positioned in the correctdirection. With information about the distance, the correct zoom ratioand focus can be adjusted.

FIG. 3 illustrates an example of another embodiment of the claimedinvention. When one of the participants (320) or (321) at station A(370) begins to speak, the acoustic signals generated by theparticipant's speech are acquired by the table microphone (322), sent tothe control unit (301) where they are processed in known fashion, andtransmitted via the transmission system (310) to station B (371). Atstation B, the received acoustic signals are reproduced over theloudspeakers (341) and (342).

The acoustic signals generated by the speaking participant (320) or(321) are also acquired by the microphones in the microphone array(365). The acquired signals are sent to the control unit (301) wheresignals from various pairs of the microphones preferably are processed,and the most likely position of the speaking participant is determinedaccording to the method described above. By a similar determination ofthe relative direction and distance between the table microphone and anauxiliary sound source in the camera, the relative direction anddistance between the camera (360 or 364) and the sound source isdetermined by means of geometrical calculations. This information isthen used to aim or adjust the direction and/or the zoom of the cameraautomatically.

For example, the determined direction may be used directly or indirectlyto adjust the orientation of the camera in order to point to theposition of the audio source. Automatic zooming may be carried out byassociating distances with zooming quantities in percentages relative toan initial view. The association between distances (or intervals ofdistances) and percentages may be stored in a table in the control unitavailable to ad hoc inquiries at the time when a new audio source isdetected or when an active speaker is moving.

Alternative embodiments of the invention may also combine audiodetection with visual signatures for fine-tuning the camera orientationand zooming. After audio detection, the active speaker is most likelywithin the view captured by the camera. The camera or the control unitthen identifies the active speaker within the view by means of apre-stored visual signature of him/her, and if the zooming/orientationof the camera relative to the active speaker are found to be inaccurate,this is adjusted according to the position of the identified activespeaker within the view. A further improvement would be to associate thevisual signatures with corresponding audio signatures. If more than onevisual signature should come within the captured view after audiodetection, the camera or the control unit would then know which of thevisual signature to choose when fine-tuning by investigating the audiosignature of the active speaker. This fine-tuning by means of visualand/or audio signatures should preferably be smoothly integrated withthe camera movements due to audio detection to prevent interruptivediscontinuous movements.

There are several advantages utilizing the method and/or arrangementaccording to the present invention, some of which are discussed in thefollowing.

Firstly, the d/r ratio will increase, as r is reduced. This means thatany angle difference implies larger time arrival differences. Further,the effective spread of angles is increased for the horizontal plane upto 360 degrees. This implies even larger time arrival differences.

Secondly, the signal from the speaker will be stronger, and so will thesignal-to-reverberation, allowing improved calculations.

Thirdly, since r is reduced, any calculated error in time difference andtherefore angle, will have a proportionally (to r) lower error on theactual position.

Further, the reduced d/r implies that a true near field assumption canbe made, and the calculation of distance will be more accurate.

Given these advantages, it is possible to find the relative positionbetween the microphone system and speaker with higher precision andincreased speed.

However, the microphone system position relative to the camera still hasto be determined with high accuracy. Using audio for this localization,and a loudspeaker placed at the camera, this is a simplified problem,due to the following: this system tends to be stationary (not moving).Therefore, all calculations can be integrated over a long time,obtaining very accurate measurements.

The audio emitted on the loudspeaker is controllable, and by selecting asignal with proper statistics, it will be easy to accurately measure thetime arrival differences, and therefore direction/angle.

The controllability of the loudspeaker allows finding the absolute timefor the audio to propagate from loudspeaker to microphone system. Sincethe sound speed is known, the absolute distance can be found. Therefore,no questionable assumption about near field between loudspeaker andmicrophone system is necessary.

Proper algorithms, for example the MLS (maximum length sequence)technique is very robust to noise, and therefore the long distancebetween loudspeaker and microphone system (i.e. low signal to noiseration) will not represent a big challenge. The MLS technique is alsoable to distinguish between the direct sound and the reflected sound,therefore, the signal-to-reverberation ratio will not represent a bigchallenge.

1. A method of localizing a position of an audio source relative to a camera, comprising steps of: determining a position of the audio source relative to at least one of a microphone and array of microphone elements; geometrically deriving a second distance and direction between the camera and one of the at least one of the microphone and array of microphone elements, geometrically deriving a first distance and direction between the camera and the audio source from the position of the audio source relative to one of the at least one the microphone and array of microphone elements and said second distance and direction.
 2. The method according to claim 1, wherein determining a position of the audio source relative to the at least one of the microphone and array of microphone elements includes: detecting a respective time difference of receiving audio signals from the audio source to the at least one of the microphone and array of microphone elements for at least one of a pair of microphones, a pair of arrays of microphone elements, and a pair of a microphone and an array of microphone elements.
 3. The method according to claim 1, wherein said second distance or direction between the camera and one of the at least one of the microphone and array of microphone elements is fixed.
 4. The method according to claim 1, wherein deriving said second distance or direction between the camera and one of the at least one of the microphone and array of microphone elements, further comprises: transmitting a sound signal from a known or detectable position relative to the camera; respectively receiving the sound signal in at least two of the microphones and arrays of microphone elements; processing the received sound signals for calculating said second distance or direction between the camera and one of the at least one of the microphone and array of microphone elements.
 5. The method according to claim 1, wherein deriving said second distance or direction between the camera and one of the at least one of the microphone and array of microphone elements, further comprises: providing said one of the at least one of the microphone and array of microphone elements with a recognizable pattern; identifying the recognizable pattern within the view captured by the camera; and determining said second distance or direction based on a size or a position of the recognizable pattern within the view.
 6. The method according to claim 1, wherein determining the position of the audio source relative to the at least one of the microphone and array of microphone elements further comprises: calculating a first time difference for receiving audio signals from the audio source to the respective microphones and arrays of microphone elements of each possible pair of microphones and arrays of microphone elements for each point in a predefined set of points; measuring a second time difference for receiving audio signals from the audio source to the respective microphones and arrays of microphone elements of each possible pair of microphones and arrays of microphone elements; calculating values of an error function for each point of the predefined set of points by summarizing the squared difference between the corresponding first and second time differences for each possible pair of microphones and arrays of microphone elements; selecting a point associated with a least value of the error function as the position of the audio source.
 7. The method according to claim 1, further comprising: executing a look up with said first distance or direction in a table associating various distances and directions with corresponding camera zooming quantities and orientations, respectively; and zooming or orienting the camera according to a result of said look up.
 8. An arrangement for localizing a position of an audio source relative to a camera that determines a position of the audio source relative to at least one of a microphone and an array of microphone elements, including: a control unit that geometrically derives a first distance or direction between the camera and the audio source from the position of the audio source relative to at least one of the microphone and array of microphone elements and a second distance or direction between the camera and one of the at least one of the microphone and array of microphone elements.
 9. The arrangement according to claim 8, wherein the control unit is further configured to detect a respective time difference of receiving audio signals from the audio source to the at least one of the microphone and array of microphone elements for at least one of a pair of microphones, a pair of arrays of microphone elements, and a pair of a microphone and an array of microphone elements.
 10. The arrangement according to claim 8, wherein said second distance or direction between the camera and one of the at least one of the microphone and array of microphone elements is fixed.
 11. The arrangement according to claim 8, further comprising: a sound signal transmitter positioned in a known or detectable position relative to said camera; and a sound signal transceiver that receives sound in at least two of the microphones and arrays of microphone elements and transmits the received sound signals to the control unit, which is configured to process the received sound signals to calculate said second distance or direction between the camera and one of the at least one of the microphone and array of microphone elements.
 12. The arrangement according to claim 8, wherein said one of the at least one of the microphone and array of microphone elements is provided with a recognizable pattern, the camera or the control unit is configured to identify the recognizable pattern within the view captured by the camera, and the control unit is configured to determine said second distance or direction based on a size or a position of the recognizable pattern within the view.
 13. The arrangement according to claim 9, wherein the control unit is further configured to calculate a first time difference for receiving audio signals from the audio source to the respective microphones and arrays of microphone elements of each possible pair of microphones and arrays of microphone elements for each point in a predefined set of points, measure a second time difference for receiving audio signals from the audio source to the respective microphones and arrays of microphone elements of each possible pair of microphones and arrays of microphone elements, calculate values of an error function for each point of the predefined set of points by summarizing the squared difference between the corresponding first and second time differences for each possible pair of microphones and arrays of microphone elements, and select a point associated with a least value of the error function as the position of the audio source.
 14. The arrangement according to claim 8, where the control unit includes a look up table associating various distances and directions with corresponding camera zooming quantities and orientations, respectively, and the control unit is configured to zoom or orient the camera according to a zooming quantity or orientation associated with said first distance or direction. 